Skip to content

[Perf] Support Flashinfer RoPE+Quant+KV update kernel for trtllm_mha backend for GPT-OSS#15729

Open
elvischenv wants to merge 10 commits intosgl-project:mainfrom
elvischenv:elvischenv/gpt-oss_rope_quant_kv
Open

[Perf] Support Flashinfer RoPE+Quant+KV update kernel for trtllm_mha backend for GPT-OSS#15729
elvischenv wants to merge 10 commits intosgl-project:mainfrom
elvischenv:elvischenv/gpt-oss_rope_quant_kv

Conversation

@elvischenv
Copy link
Contributor

@elvischenv elvischenv commented Dec 24, 2025

Motivation

This PR is to support Flashinfer rope_quantize_fp8_append_paged_kv_cache kernel for trtllm_mha backend and enable it on GPT-OSS.

Depends on a Flashinfer PR to fix the piecewise cudagraph compatibility issue: flashinfer-ai/flashinfer#2792

Tested cmd

server:

SGLANG_ENABLE_FLASHINFER_ROPE_FUSION=1 \
sglang serve \
--model-path openai/gpt-oss-120b \
--tensor-parallel-size 8 \
--kv-cache-dtype fp8_e4m3 \
--max-running-requests 1024 \
--cuda-graph-max-bs 1024 \
--stream-interval 20 \
--disable-radix-cache \
--model-loader-extra-config '{"enable_multithread_load": true}'

server with eagle:

SGLANG_ENABLE_SPEC_V2=1 \
SGLANG_ENABLE_FLASHINFER_ROPE_FUSION=1 \
sglang serve \
--model-path openai/gpt-oss-120b \
--tensor-parallel-size 8 \
--kv-cache-dtype fp8_e4m3 \
--max-running-requests 1024 \
--cuda-graph-max-bs 1024 \
--stream-interval 20 \
--disable-radix-cache \
--model-loader-extra-config '{"enable_multithread_load": true}' \
--speculative-algorithm EAGLE3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model nvidia/gpt-oss-120b-Eagle3

client(accuracy):

OPENAI_API_KEY="test" \
python -m gpt_oss.evals \
--base-url http://127.0.0.1:30000/v1 \
--model openai/gpt-oss-120b \
--reasoning-effort high \
--n-threads 512 \
--eval aime25

client(benchmark TP8 conc8):

python3 -m sglang.bench_serving \
--model openai/gpt-oss-120b \
--backend sglang \
--dataset-name random \
--max-concurrency 8 \
--num-prompts 80 \
--random-input-len 1024 \
--random-output-len 1024 \
--random-range-ratio 1.0

Accuracy Results

PR

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20260325_193213', 'metric': 0.9125}]

PR with eagle

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20260325_210444', 'metric': 0.9083333333333333}]

main

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20260325_200520', 'metric': 0.9166666666666666}]

main with eagle

[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20260325_212946', 'metric': 0.9041666666666667}]

Perf (GPT-OSS-120b TP8 con8)

PR: about 7% perf gain

Median TPOT (ms):                        2.80

main

Median TPOT (ms):                        3.02

Eagle Accept length

PR:

Accept length:                           2.05

main:

Accept length:                           1.99

Modifications

  • trtllm_mha_backend.py: support core rope_quantize_fp8_append_paged_kv_cache kernel
  • gpt_oss.py: defer RoPE into attention backend
  • radix_attention.py: defer RoPE into attention backend
  • environ.py: add SGLANG_ENABLE_FLASHINFER_ROPE_FUSION, by default disabled
  • test_trtllm_mha_backend.py: test trtllm mha backend, including basic and rope fusion functionality
  • test_gpt_oss_models_rope_fusion.py: test gpt-oss e2e accuracy with rope fusion enabled

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added the blackwell SM100/SM120 label Dec 24, 2025
@elvischenv elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from 5e5c50f to 7cc00cb Compare February 7, 2026 12:42
@elvischenv elvischenv marked this pull request as ready for review February 7, 2026 12:43
@elvischenv elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from 7cc00cb to 191dcf2 Compare February 24, 2026 04:45
@elvischenv elvischenv requested a review from HaiShaw as a code owner February 24, 2026 04:45
@elvischenv elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from 191dcf2 to e59267f Compare February 26, 2026 03:34
@nvpohanh
Copy link
Collaborator

This can be reviewed together with #19451 . They are very similar, except that one is for trtllm_mha and one is for trtllm_mla

@Fridge003
Copy link
Collaborator

For the accuracy results, which model are you testing on?
Can you please post accuracy results for MTP, to make sure its acceptance length doesn't drop

return None

def support_rope_fusion(self) -> bool:
"""Check if the current backend supports RoPE fusion."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding this method in base class, can we control this fusion with an environ flag?
Now it is set to False by default. After this feature stabilizes it can be turned on by default

@elvischenv elvischenv force-pushed the elvischenv/gpt-oss_rope_quant_kv branch from e59267f to dc6a0ac Compare March 26, 2026 03:48
@elvischenv elvischenv requested a review from Ying1123 as a code owner March 26, 2026 03:48
Copy link
Contributor Author

@elvischenv elvischenv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fridge003 Updated the testing results in the PR description. This PR currently depends on a Flashinfer PR flashinfer-ai/flashinfer#2792 to fix the compatibility issue with piecewise cudagraph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blackwell SM100/SM120

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants